home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
C/C++ Interactive Reference Guide
/
C-C++ Interactive Reference Guide.iso
/
c_ref
/
csource3
/
167_01
/
fgrep.doc
< prev
next >
Wrap
Text File
|
1985-08-21
|
28KB
|
739 lines
/*
HEADER: CUG
TITLE: Parallel Pattern Matching and FGREP
VERSION: 1.00
DATE: 09/20/85
DESCRIPTION: "Development of algorithm used in FGREP, a full
emulation of Unix's 'fgrep' utility."
KEYWORDS: fgrep, grep, filter, UNIX, pattern-matching
SYSTEM:
FILENAME: FGREP.DOC
WARNINGS:
CRC: xxxx
SEE-ALSO: FGREP.C
AUTHORS: Ian Ashdown - byHeart Software
COMPILERS:
REFERENCES: AUTHORS: Bell Telephone Laboratories;
TITLE: UNIX Programmer's Manual Vol. 1, p. 166;
AUTHORS: A.V. Aho & M.J. Corasick;
TITLE: 'Efficient String Matching: An Aid to
Bibliographic Search'
Communications of the ACM
pp. 333 - 340, Vol. 18 No. 6 (June '75);
ENDREF
*/
/*-------------------------------------------------------------*/
Ian Ashdown
byHeart Software
1089 West 21st Street
North Vancouver
British Columbia V7P 2C6
Canada
Date: April 12th, 1985
Parallel Pattern Matching and FGREP
-----------------------------------
by Ian Ashdown
byHeart Software
*****************************************************************
* *
* NOTE: This manuscript was published in the September 1985 *
* issue of Doctor Dobb's Journal. It may be reproduced *
* for personal, non-commercial use only, provided that *
* the above copyright notice is included in all copies. *
* Copying for any other use without previously obtaining *
* the written permission of the author is prohibited. *
* *
*****************************************************************
Preparing an index to a large technical book. Finding all
references to a person in several years' worth of a company's
minutes of meetings. Searching for all occurrences of a specific
sequence of events in the volumes of data obtained from a
scientific experiment. All of these problems and many more are
examples of searching for patterns in a set of data. The question
is, what is the most efficient search algorithm to use?
For single patterns, such as searching for the phrase
"binary tree" in a text file, there are two of particular
interest: the Boyer-Moore Algorithm (reference 5) and the
Knuth-Morris-Pratt Algorithm (reference 6). Both are fast and
reasonably simple to implement. However, neither is appropriate
when more than one pattern at a time must be searched for.
One algorithm that is appropriate can be found in a paper
published in the June '75 issue of "Communications of the ACM"
(reference 1). Entitled "Efficient String Matching: An Aid to
Bibliographic Search" and written by Alfred Aho and Margaret
Corasick of Bell Laboratories, the paper presents "a simple and
efficient algorithm to locate all occurrences of any of a finite
number of keywords in a string of text". Without modification to
the algorithm, a "keyword" can be taken to mean any pattern, and
"a string of text" to mean any sequence of symbols, be it ASCII
text, digitized data obtained from a video camera, or whatever.
The algorithm has some interesting features. For
instance, most pattern matching schemes must employ some form of
backtracking, or rescanning of the string when an attempted match
to a pattern fails or multiple patterns are to be matched. The
Aho-Corasick algorithm differs in that it searches for all of the
patterns in parallel. By doing so, it can process the string in
one pass without any backtracking.
The algorithm also recognizes patterns that overlap in
the string. For example, if the patterns are "he", "she" and
"hers", the algorithm will correctly identify matches to all
three in the string "ushers".
As to speed of execution, it is independent of the number
of patterns to be matched! This perhaps surprising feature is a
direct result of the patterns being searched for in parallel.
Curiously, this algorithm has all but disappeared from
the literature of computer science. Of the hundreds of textbooks
and articles written since that discuss pattern matching, only a
few reference the paper, and none that I know of present the
Aho-Corasick algorithm itself. Unless you have access to back
issues of "Communications of the ACM", you will likely never
encounter it.
On the other hand, if you work with UNIX you may have
used a utility based on the algorithm: "fgrep". This useful tool
searches for and displays all occurrences of a set of keywords
and phrases in one or more text files. While it does not accept
wildcard characters in its patterns like UNIX's more flexible
"grep" utility, it does illustrate the speed of the Aho-Corasick
algorithm in comparison with other pattern matching methods.
Typically, "fgrep" is five to ten times faster than "grep" in
searching files for fixed string patterns.
Given its general usefulness, I thought it appropriate to
reintroduce Aho and Corasick's work in this article. At the same
time, it seemed a shame that "fgrep" has been restricted to the
UNIX operating system. Therefore, the source code in C for a full
implementation (reference 4) of "fgrep" has been included. This
serves not only to demonstrate the algorithm, but also to bring
the idea of a public domain UNIX one small step closer to
reality.
Inside The Machine
------------------
The Aho-Corasick algorithm consists of two phases:
building a "finite state automaton" (FSA for short), then running
a string of data through it, with each consecutive symbol being
considered a separate input. Before analyzing these phases, a
quick summary of what finite state automata are and how they work
is in order. (For a more detailed discussion, reference 2 is
recommended.)
In essence, an FSA is a conceptual machine that can be in
any one of a finite number of "states". It also has a set of
"state transition" rules and a set of "inputs", which together
with the set of states define what inputs cause the machine to
change from one state to another. Finally, since a machine serves
no purpose without producing some output, an FSA has one or more
"terminal" states. Output is produced only when the FSA enters
such states.
A physical example of an FSA could be a light switch.
This device has two states, ON and OFF. Its inputs consist of
someone pushing the switch UP or DOWN. ON is a terminal state,
where the associated lamp produces visible radiation. The state
transition rules can be stated in a simple truth table:
+------------------+
| | OFF | ON |
|----|------|------|
| UP | ON | -- |
|DOWN| -- | OFF |
+------------------+
Alternatively, this could be shown as a "state transition
diagram":
DOWN
+-------------+
| |
V |
+-----+ UP ******
+-->| OFF |------>* ON *<---+
| +-----+ ****** |
| | | |
| DOWN | | UP |
+------+ +------+
where no state transition on a particular input (as shown in the
truth table) is equivalent to a transition from a state to itself
(as shown in the diagram).
Getting ahead of ourselves for a moment, look at Figure
1a, which shows part of an FSA (the other parts are shown in
Figures 1b and 1c, and will be explained shortly). By having
sequences of states, it is possible for the FSA to recognize
patterns in an input stream. For example, presenting the string
"she" to the FSA shown in Figure 1a would cause it to change from
State 0 to States 3, 4 and 5 as each input symbol of the string
is processed. State 5 is a terminal state that indicates the
pattern "she" was recognized.
There are two types of FSA that can be built, one
"nondeterministic" (Algorithm 1) and the other "deterministic"
(Algorithm 2). The nondeterministic one (NFSA for short) uses
functions called "go_to", "failure" and "output", while the
deterministic FSA (DFSA) uses "move" and "output". The difference
between the two is that while the NFSA can make one or more state
transitions per input symbol, the DFSA makes only one. With fewer
transitions to make, a DFSA can process its input in less time
than an equivalent NFSA.
The "go_to" function accepts as input parameters the
current state of the NFSA and the current input symbol, and
returns either a state number (the "go_to" state transition) if
the input symbol matches that expected by the patterns being
matched, or else a unique symbol called FAIL. The only exception
to this is State 0 - "go_to(0,X)" returns a state number for all
input symbols "X". If symbol "X" does not match the beginning
symbol of any of the patterns, the state number returned is 0.
The "failure" function is called whenever the "go_to"
function returns FAIL. Accepting the current state as its input
parameter, it always returns a state number, the "failure" state
transition.
For the DFSA, the "move" function accepts the current
state number and input symbol, and always returns the next state
number. There are no failure state transitions in deterministic
FSA's.
Both the NFSA and the DFSA use the "output" function. As
defined in Aho and Corasick's paper, this function prints the
patterns as they are matched by the FSA. However, the definition
of this function can be much more general. In fact, the FSA can
be implemented to execute any arbitrary set of procedures
whenever it recognizes a pattern.
FSA's as used by the Aho-Corasick algorithm can be either
nondeterministic or deterministic. While a DFSA executes more
quickly than its equivalent NFSA, it also requires more memory to
encode its state transition tables. Which type is chosen depends
upon the application and the constraints of execution speed and
memory usage.
As an example, assume a set of text patterns to be
matched: "he", "she", "his" and "hers", with ASCII characters as
the set of input symbols. (These have been unabashedly purloined
from Aho and Corasick.) The FSA's needed to recognize these
patterns are shown in Figure 1.
Looking at the NFSA first: using as input the string
"ushers", the machine is run using Algorithm 1. Starting in State
0, the first symbol from the string is "u". Since only symbols
"h" and "s" lead from State 0 to other states, "go_to(0,u)"
returns 0 and so the NFSA remains in State 0.
The next symbol from the string is "s". Following Figure
1a, the NFSA makes a go_to state transition to State 3. The next
symbol ("h") causes a go_to transition to State 4, and the next
("e") to State 5. There is an output defined for State 5 (Figure
1c), and so the NFSA prints "she,he", having recognized two
pattern matches.
Continuing, the next character is "r". There are no go_to
transitions defined for State 5, and so "go_to(5,r)" returns
FAIL. This causes "failure(5)" to return 2 (from Figure 1b),
making the NFSA perform a failure transition to State 2. From
here, Algorithm 1 executes "go_to(2,r)", which causes the NFSA to
enter State 8.
Finally, the last input symbol, "s", leads the NFSA to
enter terminal State 9, and the output defined for this state
causes "hers" to be printed. In all, Algorithm 1 recognized the
patterns "he", "she" and "hers" in the string "ushers". The state
transitions made can be summarized as:
Input Symbol: u s h e r s
Current State: 0 0 3 4 5 2 8 9
Turning our attention to the DFSA and using the same
input string "ushers", this machine makes move state transitions
in accordance with Algorithm 2 and Figure 1d. Its state
transitions can be summarized as:
Input Symbol: u s h e r s
Current State: 0 0 3 4 5 8 9
The only difference is that the failure transition to State 2 was
skipped. By being able to avoid any failure transitions,
Algorithm 2 can recognize a pattern in fewer state transitions
and hence less time than Algorithm 1.
Software Construction
---------------------
Having seen how FSA's work, it is now time to see how
they are built, starting with the computation of the "go_to"
function by Algorithm 3.
The "go_to" function is initially defined to return FAIL
for every input symbol and every state. Each pattern is run
through procedure "enter", which tries to match the pattern
symbol by symbol to the existing partially-constructed "go_to"
function. When a failure occurs, a new state is created and added
to the "go_to" function, with the current symbol of the pattern
being the input for the state transition. Thereafter, each
consecutive symbol of the pattern causes a new state to be
created and added.
When the last symbol of each pattern has been processed
by procedure "enter", its associated state is made a terminal
state. As shown in Algorithm 3, this means that the output for
the state is defined as the current pattern, which Algorithm 1
will print when the NFSA is run. However, it is easy to see that
the statement "output(state) = pattern" in Algorithm 3 can be
rewritten to assign whatever procedures to "output(state)" that
you want.
Finally, after all of the patterns have been processed,
"go_to(0,X)" is redefined to return 0 for all symbols "X" that
still return FAIL.
When Algorithm 3 completes, it has computed the "go_to"
function of the FSA and also partially computed the "output"
function. If you simulate Algorithm 3 by hand with "he", "she",
"his" and "hers" as its patterns, you will see that the output
for State 5 will be "she", not "she,he". It remains for Algorithm
4 to complete the "output" function as it computes the "failure"
function.
Let's define the "depth" of a state as the number of
go_to state transitions that must be made from State 0 to reach
it. For example, the depth of State 4 in Figure 1a is 2, while
that of State 9 is 4. The failure state transition for all states
of depth 1 is State 0. The algorithm used to compute the failure
transitions for all states of depth greater than 1 can be
expressed in its simplest form as:
for each depth D do
for each state S of depth D-1 do
for each input symbol A do
if (T = go_to(S,A)) != FAIL then
begin
R = failure(S)
while go_to(R,A) == FAIL do
R = failure(R)
failure(T) = go_to(R,A)
end
To demonstrate this using our example, let's compute one
failure transition for a state of depth 2. The states of depth 1
are 1 and 3. The only input symbols for which the "go_to"
function does not return FAIL are "e" and "i" for State 1 and "h"
for State 3. Taking State 3 for "S" and symbol "h" for "A" above,
we have "T" being 4, "R" being "failure(3)", which is 0,
"go_to(0,h)" returning 1, and finally "failure(4)" being set to
"go_to(0,h)". The failure transition for State 4 is thus State 1.
Algorithm 4 expands on the above algorithm in two ways.
First, it uses a queue (a first-in, first-out list) to maintain
the order of states by depth to be processed. Second, it accepts
as input the "output" function and combines the outputs of the
terminal states where appropriate. In our example, the output
"he" of State 2 would be combined with the output "she" of State
5 to produce the final and correct "she,he" output for State 5.
The "move" function is computed from the "go_to" and
"failure" functions by means of Algorithm 5. Essentially, all
this algorithm does is precompute all possible sequences of
failure state transitions. It is also very similar to Algorithm
4. Since there is considerable redundancy between them, they can
be merged to compute the "failure" and "move" functions
concurrently. The result is shown as Algorithm 6.
Details, Details
----------------
So far, the functions "go_to", "failure", "move" and
"output" have been shown as something that you can simply assign
outputs to for specified inputs. This assumes a much higher-level
programming language than most of today's offerings. Implementing
these algorithms in C, Pascal, Basic or Fortran requires some
extra code and work.
Aho and Corasick programmed their original version in
Fortran. This is evident from their paper, where they suggest
that the "failure" and "output" functions could be implemented as
one-dimensional arrays accessed by state numbers. With a language
such as C that supports complex data structures, the code for the
functions can be more elegantly written by replacing the state
numbers and arrays with dynamically allocated structures. Each
structure can have as members pointers to linked lists of go_to
and move transitions, a pointer to the appropriate failure
transition, and a pointer to a character string for the output
function.
They also noted that the "go_to" and "move" functions
could be implemented as two-dimensional arrays, with the number
of states forming one dimension and the set of input symbols
forming the other. However, this would require enormous amounts
of memory for an ASCII character set and several hundred states.
Their recommendation was that the go_to and move transitions for
each state be implemented as linked linear lists or binary trees.
Since the FSA will usually spend most of its time in
State 0 for large input symbol sets, it is advantageous to have
the "go_to" or "move" functions execute as quickly as possible
for this state. Therefore, following Aho & Corasick's suggestion,
a one-dimensional array can be assigned to State 0. (Note that
since there are no failure transitions defined for State 0, its
go_to and move transitions are one and the same.) The array is
accessed directly, using the input symbol as the index, with the
corresponding array entry being the state transition for that
symbol.
An improvement not covered by Aho & Corasick can be made
when it realized that often several states will have the same
move transitions (see Figure 1d as an example). In such cases,
the state structures can be assigned a common pointer to one
linked list of move transitions.
Some applications may call for the algorithm to be coded
with predefined patterns in read-only memory. Here it is usually
desired to maximize the speed of execution while at the same time
minimizing the code size. Aho and Ullman (reference 2) discuss a
method of encoding the state transition tables that combines the
compactness of linked lists with the speed of direct array
access. Their method (which UNIX's "yacc" compiler-compiler
utility uses to encode its parsing tables) is unfortunately too
involved to discuss here. Interested readers are referred to Aho
and Ullman's book for details.
Putting It All To Work
----------------------
Implementing the Aho-Corasick algorithm is fairly
straightforward, although as you can see, the code required to
flesh it out to a full emulation of UNIX's "fgrep" is somewhat
involved. Regardless, the work is done for you. If you want to
use the algorithm in other programs, simply remove the "fgrep"
shell and add whatever interface routines you require.
An example: as an information retrieval utility that
searches arbitrary text files, "fgrep" is useful but by no means
complete. One useful extension would be the ability to specify
Boolean operators (AND, OR and EXCLUSIVE-OR) for combinations of
patterns.
Another one: rather than simply print matched patterns
for its output, the FSA can execute whatever procedures you
choose to associate with the patterns as they are matched. From
this you could create a macro processor, where the FSA accepts an
input string, finds matches to predefined patterns, substitutes
the corresponding macro expansions, and then emits the result as
an output string.
For those with RAM memory to spare (a megabyte or so),
consider how fast a spelling checker that makes no disk accesses
beyond initially loading itself could be made to run ... fifty
thousand or more words in RAM and your text file is checked
almost as fast as it can be read.
If you come up with an original and fascinating use for
the Aho-Corasick algorithm, or if you derive new utilities from
"fgrep", by all means let me know about it. Better yet, donate
your code to a public domain software group. That alone would
repay me for the effort I put into developing the code and
writing this article.
------------------------------------------------------
Algorithm 1: Nondeterministic Pattern Matching Machine
------------------------------------------------------
Input: A string of symbols and a pattern-matching machine
consisting of functions "go_to", "failure" and "output".
Output: Matched patterns.
begin
state = 0
while there are more symbols in the string do
begin
a = next symbol in the string
while go_to(state,a) == FAIL do
state = failure(state)
state = go_to(state,a)
if output(state) != NULL then
print output(state)
end
end
---------------------------------------------------
Algorithm 2: Deterministic Pattern Matching Machine
---------------------------------------------------
Input: A string of symbols and a pattern-matching machine
consisting of functions "move" and "output".
Output: Matched patterns.
begin
state = 0
while there are more symbols in the string do
begin
a = next symbol in the string
state = move(state,a)
if output(state) != NULL then
print output(state)
end
end
---------------------------------------------
Algorithm 3: Computation of Go_to Transitions
---------------------------------------------
Input: Array of patterns {y[1],y[2] .... y[k]} where each
pattern is a string (one-dimensional array) of symbols.
(Function "go_to(s,a)" is defined to return FAIL for all
states "s" and all input symbols "a" until defined
otherwise by procedure "enter".)
Output: Function "go_to" and partially computed function
"output".
begin
newstate = 0
for i = 1 to i = k do
enter(y[i])
for each symbol a do
if go_to(0,a) == FAIL then
go_to(0,a) = 0
end
procedure enter(pattern) /* "pattern" is an array of */
begin /* "m" symbols */
state = 0
j = 1
while go_to(state,pattern[j]) != FAIL do
begin
state = go_to(state,pattern[j])
j = j + 1
end
for p = j to p = m do
begin
newstate = newstate + 1
output(newstate) = NULL
go_to(state,pattern[p]) = newstate
state = newstate
end
output(state) = pattern
end
-----------------------------------------------
Algorithm 4: Computation of Failure Transitions
-----------------------------------------------
Input: Functions "go_to" and "output" from Algorithm 3.
Output: Functions "failure" and "output".
begin
initialize queue
for each symbol a do
if (r = go_to(0,a)) != 0 then
begin
add r to tail of queue
failure(r) = 0
end
while queue is not empty do
begin
s = head of queue
remove head of queue
for each symbol a do
if (t = go_to(s,a)) != FAIL then
begin
add t to tail of queue
r = failure(s)
while go_to(r,a) == FAIL do
r = failure(r)
failure(t) = go_to(r,a)
output(t) = output(t) + output(failure(t))
end
end
end
--------------------------------------------
Algorithm 5: Computation of Move Transitions
--------------------------------------------
Input: Function "go_to" from Algorithm 3 and function "failure"
from Algorithm 4.
Output: Function "move".
begin
initialize queue
for each symbol a do
begin
move(0,a) = go_to(0,a)
if (r = go_to(0,a)) != 0 then
add r to tail of queue
end
while queue is not empty do
begin
s = head of queue
remove head of queue
for each symbol a do
if (t = go_to(s,a)) != FAIL then
begin
add t to tail of queue
move(s,a) = t
end
else
move(s,a) = move(failure(s),a)
end
end
---------------------------------------------------------------
Algorithm 6: Merged Computation of Failure and Move Transitions
---------------------------------------------------------------
Input: Functions "go_to" and "output" from Algorithm 3.
Output: Function "move".
begin
initialize queue
for each symbol a do
begin
move(0,a) = go_to(0,a)
if (r = go_to(0,a)) != 0 then
begin
add r to tail of queue
failure(r) = 0
end
end
while queue is not empty do
begin
s = head of queue
remove head of queue
for each symbol a do
if (t = go_to(s,a)) != FAIL then
begin
add t to tail of queue
r = failure(s)
while go_to(r,a) == FAIL do
r = failure(r)
failure(t) = go_to(r,a)
output(t) = output(t) + output(failure(t))
move(s,a) = t
end
else
move(s,a) = move(failure(s),a)
end
end
=================================================================
Figure 1: Pattern Matching Machine
----------------------------------
!{h,s} (any symbol but "h" or "s")
+-----+
| V
| +---+ h +---+ e ***** r +---+ s *****
+---| 0 |---->| 1 |---->* 2 *---->| 8 |---->* 9 *
+-+-+ +-+-+ ***** +---+ *****
| |
| | i +---+ s *****
| +------>| 6 |---->* 7 *
| +---+ *****
|
| s +---+ h +---+ e *****
+------>| 3 |---->| 4 |---->* 5 *
+---+ +---+ *****
(a) Go_to function
Current State: 1 2 3 4 5 6 7 8 9
Failure State: 0 0 0 1 2 0 3 0 3
(b) Failure function
Current Output:
State:
0 --
1 --
2 {he}
3 --
4 --
5 {she,he}
6 --
7 {his}
8 --
9 {hers}
(c) Output function
Current Input Next Current Input Next
State: Symbol: State: State: Symbol: State:
0 h 1 4 e 5
s 3 i 6
. 0 h 1
1 e 2 s 3
i 6 . 0
h 1 6 s 7
s 3 h 1
. 0 . 0
2,5 r 8 8 s 9
h 1 h 1
s 3 . 0
. 0
3,7,9 h 4
s 3
. 0
where "." represents any symbol not included in the list of
symbols for the indicated state(s).
(d) Move function
=================================================================
References:
1. Aho, A.V. and Corasick, M.J. [1975]. Efficient String
Matching: An Aid to Bibliographic Search. CACM 18:6 (June),
pp.333-340.
2. Aho, A.V. and Ullman, J.D. [1977]. Principles of Compiler
Design. Addison Wesley, pp.73-124.
3. Aoe, J., Yamamoto, Y. and Shimada, R. [1984]. A Method for
Improving String Pattern Matching Machines. IEEE Transactions
On Software Engineering, SE-10:1 (January), pp.116-120.
4. Bell Telephone Laboratories [1983], UNIX Programmer's Manual
Volume 1, Holt, Rinehart and Winston, p.70-71.
5. Boyer, R.S. and Moore, J.S. [1977]. A Fast String Searching
Algorithm. CACM 20:10 (October), pp.762-772.
6. Knuth, D.E., Morris, J.H. and Pratt, V.R. [1977]. Fast Pattern
Matching in Strings, SIAM J. Comp. 6:2 (June), pp.323-350.
- End -
where "." represents any symbol not included in the list of
symbols for the indicated sta